Skip to content

Conversation

@da-roth
Copy link
Contributor

@da-roth da-roth commented Jan 10, 2026

This PR integrates the Forge JIT backend for XAD, adding optional native code generation support. Forge is an optional dependency - everything builds and runs without it.

Changes

Build options added:

  • QLRISKS_ENABLE_FORGE: Enable Forge JIT backend
  • QLRISKS_ENABLE_FORGE_TESTS: Include Forge tests in test suite
  • QLRISKS_ENABLE_JIT_TESTS: Enable XAD JIT tests (interpreter backend, no Forge)
  • QLRISKS_BUILD_BENCHMARK / QLRISKS_BUILD_BENCHMARK_STANDALONE: Benchmark executables

Files added:

  • test-suite/jit_xad.cpp: JIT infrastructure tests (interpreter backend)
  • test-suite/forgebackend_xad.cpp: Forge backend tests
  • test-suite/swaption_jit_pipeline_xad.cpp: JIT pipeline tests for LMM Monte Carlo
  • test-suite/swaption_benchmark.cpp: Boost.Test benchmarks
  • test-suite/benchmark_main.cpp: Standalone benchmark executable
  • test-suite/PlatformInfo.hpp: Platform detection utilities
  • .github/workflows/ql-benchmarks.yaml: Benchmark workflow

Files modified:

  • CMakeLists.txt: Forge integration options
  • test-suite/CMakeLists.txt: Conditional test/benchmark targets
  • .github/workflows/ci.yaml: Added forge-linux and forge-windows jobs

Benchmarks

The benchmark workflow (ql-benchmarks.yaml) runs swaption pricing benchmarks comparing FD, XAD tape, JIT scalar, and JIT-AVX methods on Linux and Windows.

Also included some initial work towards #33 - the workflow has type overhead jobs that compare double vs xad::AReal pricing performance (no derivatives) on the same hardware, providing a baseline for measuring XAD type overhead.

Example benchmark run (Linux) Link

@da-roth
Copy link
Contributor Author

da-roth commented Jan 27, 2026

Hi @auto-differentiation-dev, thanks for the thorough review!

I refactored the benchmark tests quite a bit - hope I incorporated everything as wished. Building QL now 3 times (native double, XAD (JIT=OFF), XAD (JIT=ON)) and creating a combined report:
image
and since I saw for 100k paths that JIT=Off seemed to run into a bottleneck (I'd guess it's a memory issue since we are recording all paths and doing one adjoint call), I added XAD-Split to the JIT=Off, too. It does the same as in the JIT=On case (use XAD for bootstrap and create Jacobian and then XAD per-path). Its performance scales as expected.
Best, Daniel

@da-roth da-roth mentioned this pull request Feb 2, 2026
@da-roth
Copy link
Contributor Author

da-roth commented Feb 2, 2026

Hi @auto-differentiation-dev ,

this PR is now in a state for another round of review. I put the overhead workflow into a new PR that is checked out from this PR's branch and I'll follow up on it ( #37) once this one is merged.

Best, Daniel

@auto-differentiation-dev
Copy link
Contributor

Hi @da-roth,

Sorry it took us some time to come back on this, and thank you for all the work - it is taking good shape.

For XAD, we think only the XAD-split mode should be reported. This is the practical way to run Monte Carlo with XAD using path-wise derivatives, and it is what we would encourage users to adopt (see https://auto-differentiation.github.io/faq/#quant-finance-applications
).

Looking at the results, we noticed a few patterns in the reported timings that seem unexpected and would be good to clarify. Overall, all methods show the expected linear behaviour - a fixed overhead plus a component that grows linearly with the number of paths. That said, the relative behaviour between methods looks unusual.

In particular:

  • The fixed cost for FD is extremely large (around 9s). Conceptually, this should correspond to the curve bootstrapping cost multiplied by the number of sensitivities, but even then it looks excessively high. Can you clarify what is driving this overhead?
  • The per-path cost of FD compared to XAD differs only by about a factor of 4, even though FD requires roughly 45 re-runs. Intuitively, we would expect a much larger gap here - how do you explain this behaviour?
  • Forge with AVX2 appears around 6x faster than Forge without AVX2 in terms of per-path scaling, despite AVX2 only providing 4 double-precision vector lanes. This seems stronger than what hardware vectorisation alone would suggest.
  • The reported curve bootstrap time with XAD of 223ms does not seem consistent with the overall trends implied by the data, where the fixed overhead looks closer to ~133ms, and Forge compilation appears to be around 120ms.

This makes us wonder whether there might be something off in how the timings are being measured or attributed. It would be helpful to understand how you interpret these trends.

Thanks again, and happy to discuss further.

fixes

use original evolve for xad-split

tried fixes

first try

added local script for fd and running benchmark locally
@auto-differentiation-dev
Copy link
Contributor

Hi @da-roth,

Looking at the numbers again, maybe it is just a matter of increasing the Monte Carlo workload so runtime isn't dominated by bootstrapping. That would better reflect a real-world setup/application. For example, we could include a portfolio of swaptions and additional cashflows.

@da-roth
Copy link
Contributor Author

da-roth commented Feb 3, 2026

Hi @auto-differentiation-dev ,

did some investigations, your remarks and intuitions were right. The QL code did some nasty re-computations of matrices during each step in the MC simulation. The code that implements this example is not that optimal - but being so, it seems to be a really good working example for future works. It shows the impact of overhead between double - AReal, and the latest results also indicate where Forge is still not that optimal.

Anyway, I did some minor optimization such that the matrix is only computed once per path and see this locally:
Timings native double FD:

Paths Method Mean StdDev Speedup
1K FD 4820.3 10.0 ---
--------+-----------+----------+----------+---------
10K FD 6781.9 0.0 ---
--------+-----------+----------+----------+---------
100K FD 26416.9 0.0 ---

Timings AReal with JIT = ON:

Paths Method Mean StdDev Setup* Speedup
1K XAD 180.0 0.8 --- ---
XAD-Split 143.5 2.4 109.5 1.25x
JIT 190.7 5.0 128.7 0.94x
JIT-AVX 143.8 0.5 128.7 1.25x
--------+-----------+----------+----------+---------+---------
10K XAD 847.8 4.0 --- ---
XAD-Split 449.3 1.5 110.9 1.89x
JIT 721.7 0.0 127.9 1.17x
JIT-AVX 258.5 1.6 127.9 3.28x
--------+-----------+----------+----------+---------+---------
100K XAD 7435.5 0.0 --- ---
XAD-Split 3463.5 0.0 110.7 2.15x
JIT 6060.8 0.0 127.6 1.23x
JIT-AVX 1413.2 0.0 127.6 5.26x

So AAD + AReal overhead (ofc implified by unoptimal implementation in QL) gives still a roughly 7.5x benefit for this example of XAD vs native double.

Interesting is that XAD-Split is faster than JIT - I think it shows the benefit of how XAD is optimized and the overhead of doing unnecessary computations.

The overhead of JIT vs JIT-AVX is not surprising me too much - I spent some time improving the throughput of setting inputs and getting the outputs. Hence we have something like 4 lanes + some infrastractural that can be applied to JIT scalar as well. I'll apply these in the future to the scalar JIT as well.

Let's see how the benchmarks look in the cloud. Would you wish any changes here? I really like the example since it'll give us all the insights for future improvements, but ofc one could create something that gives us higher x compared to native FD if wished (more inputs, trying to further dig if we can avoid unnecessary computations etc.). Thinking out loud my intuition is that: While I'd expect that the better XAD performs compared to native FD, I think the nearer JIT scalar will get to XAD-split (and at some point be slightly faster as we saw in the XAD repo's results).

Cheers, Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants